Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Estimate partition memory usage based on previous attempts #11857

Merged
merged 5 commits into from
Apr 9, 2022

Conversation

losipiuk
Copy link
Member

@losipiuk losipiuk commented Apr 7, 2022

Description

Estimate partition memory usage based on previous attempts.
This applies for execution with task-level retries when bin-packing node allocator is selected

Is this change a fix, improvement, new feature, refactoring, or other?

improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

core engine

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

(x) No release notes entries required.
( ) Release notes entries required with the following suggested text:

@cla-bot cla-bot bot added the cla-signed label Apr 7, 2022
@losipiuk losipiuk requested review from arhimondr and linzebing April 7, 2022 20:43
@losipiuk
Copy link
Member Author

losipiuk commented Apr 7, 2022

@arhimondr do we need RN for this?

Copy link
Contributor

@arhimondr arhimondr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % comments

@losipiuk losipiuk force-pushed the lo/smarter-partition-sizing branch from bd5e8b0 to 835f201 Compare April 7, 2022 22:36
@losipiuk
Copy link
Member Author

losipiuk commented Apr 7, 2022

Rebased on top of #11861

@losipiuk losipiuk force-pushed the lo/smarter-partition-sizing branch from 835f201 to f7ea0b7 Compare April 7, 2022 22:57
@losipiuk
Copy link
Member Author

losipiuk commented Apr 7, 2022

AC

@github-actions github-actions bot added the docs label Apr 7, 2022
package io.trino.execution.scheduler;

@FunctionalInterface
public interface PartitionMemoryEstimatorFactory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: why having a factory is beneficial here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want a new instance for each stage, so we can do estimations based on different tasks which completed for this stage.

// take previousRequiredBytes into account when registering failure on oom. It is conservative hence safer (and in-line with getNextRetryMemoryRequirements)
long previousRequiredBytes = previousMemoryRequirements.getRequiredMemory().toBytes();
long previousPeakBytes = peakMemoryUsage.toBytes();
memoryUsageDistribution.add(Math.max(previousRequiredBytes, previousPeakBytes) * growthFactor);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we add estimated memory usage to the distribution? If retry succeeds, then the actual usage will be added, right? This seems to skew the metric.

I understand your intention though. If one task consumes large amount of memory, then other tasks may also need large amount of memory. But this will make the stats collection inaccurate, maybe we should explore some other approach instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - this surely is not exact science and I am not sure how well it will work in practice. But intention is exactly what you wrote. If we see that tasks are dying because we gave them too little memory we want to bump initial memory already for new tasks. Not wait until we have one which succeeds (it may take a lot of time till we have one).

I was thinking first about having two separate histograms for successful and unsuccessful tries. And make the one for unsuccessful decaying over time so "newer data is more important" - but I did not come up with reasonable way to merge the data from both, so I implemented the simple (yet I agree not 100% bullet-proof) approach.

Happy to hear suggestions how to improve though :)

BTW: I will add a commit on top with extra debug logging so we can see how it works in practice when testing out queries on cluster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. Actually I guess with some tuning your approach might work well in practice. We can leave it as it is for now.

@losipiuk losipiuk force-pushed the lo/smarter-partition-sizing branch from f7ea0b7 to 6ea2208 Compare April 8, 2022 13:04
@losipiuk losipiuk force-pushed the lo/smarter-partition-sizing branch from 6ea2208 to 57457cf Compare April 8, 2022 17:24
@losipiuk losipiuk merged commit 6fba6a4 into trinodb:master Apr 9, 2022
@github-actions github-actions bot added this to the 377 milestone Apr 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants